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ABSTRACT 

The literature search has always been an important part 
of an academic research. It greatly helps to improve the 
quality of the research process and output, and increase the 
efficiency of the researchers in terms of their novel contribu- 
tion to science. As the number of published papers increases 
every year, a manual search becomes more exhaustive even 
with the help of today's search engines since they are not 
specialized for this task. In academics, two relevant papers 
do not always have to share keywords, cite one another, or 
even be in the same field. Although a well-known paper is 
usually an easy pray in such a hunt, relevant papers using a 
different terminology, especially recent ones, are not obvious 
to the eye. 

In this work, we propose paper recommendation algo- 
rithms by using the citation information among papers. The 
proposed algorithms are direction aware in the sense that 
they can be tuned to find either recent or traditional pa- 
pers. The algorithms require a set of papers as input and 
recommend a set of related ones. If the user wants to give 
negative or positive feedback on the suggested paper set, the 
recommendation is refined. The search process can be easily 
guided in that sense by relevance feedback. We show that 
this slight guidance helps the user to reach a desired paper 
in a more efficient way. We adapt our models and algorithms 
also for the venue and reviewer recommendation tasks. Ac- 
curacy of the models and algorithms is thoroughly evaluated 
by comparison with multiple baselines and algorithms from 
the literature in terms of several objectives specific to ci- 
tation, venue, and reviewer recommendation tasks. All of 
these algorithms are implemented within a publicly avail- 
able web-service framework which currently uses the data 
from DBLlQ and CiteSeei^] to construct the proposed cita- 
tion graph. 
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1. INTRODUCTION 

The academic community has published millions of re- 
search papers to date and the number of new papers has 
been increasing with time. For example, based on DBLP, 
computer scientists published 3 times more papers in 2010 
than in 2000 (see Figure [TJ-left). With more than one hun- 
dred thousand new papers each year, performing a complete 
literature search became a herculean task. A paper cites in 
average 20 other papers (see Figure [TJ-right), which means 
that there might be more than a thousand papers that cite 
or are cited by any paper a researcher write. Researchers 
typically rely on manual methods to discover new research 
such as keyword-based search on search engines, reading pro- 
ceedings of conferences, browsing publication list of known 
experts or checking the reference list of paper they are in- 
terested. These techniques are time-consuming and only 
allow to reach a limited set of documents in a reasonable 
time. Developing tools that help researchers find unknown 
and relevant papers will certainly increase the productivity 
of the scientific community. 

Some of the existing approaches and tools for the litera- 
ture search cannot compete with the size of today's litera- 
ture. Keyword-based approaches suffer from the confusion 
induced by different names of identical concepts in differ- 
ent fields. (For instance, partially ordered set or poset are 
also often called directed acyclic graph or DAG). Hence, a 
researcher may not be able to find the right paper even she 
is suggested to scan a long list of papers by a keyword- 
based approach. Conversely, two different concepts may 
have the same name in different fields (for instance, hybrid 
is commonly used to specify software hybridization, hard- 
ware hybridization or algorithmic hybridization) and such 
homonyms may drastically increase the number of suggested 
but unrelated papers. Some publishers and digital libraries 
automatically suggest papers to authors; however, their sug- 
gestions are usually based on the publication history of the 
researcher which may not match with her current interests. 

To achieve this goal, we built a publicly available web 
service called theadvisor^] It takes a bibliography file con- 
taining a set of papers, i.e., seeds, as an input to initiate the 
search. The user can specify that she is interested in clas- 
sical papers or in recent papers. Then, the service returns 
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Figure 1: Number of new papers published each year 
based on DBLP (left), and number of papers with 
given citation and reference count (right). 



a set of suggested papers ordered with respect to a ranking 
function. The user can guide the search or prune the list 
of suggested papers with a positive or negative feedback by 
declaring a subset relevant or irrelevant. In this case, the 
service completely refines the set and shows the new results 
back to the researcher. In addition to papers, the service also 
suggests researchers or experts, and conferences or journals 
of interest. We believe that it will be a valuable asset of a 
researcher while performing several tasks, such as: 

• searching the literature in any topic she is interested, 

• finding recent or traditional papers related to a problem, 

• improving the reference list of a manuscript being written, 

• finding conferences and journals for attendance, subscrip- 
tion, or paper submission, 

• finding a set of researchers in a field of interest to follow 
their work, 

• finding a list of potential reviewers, which is required by 
certain journals in the submission process. 

The service uses the bibliographical information while sug- 
gesting relevant papers, venues, and people to the researcher. 
For each paper, it uses the authorship and venue informa- 
tion in addition to the list of papers it cites. The service 
works on a modified version of the citation graph which is 
constructed by using this information. In other words, the 
service recommends papers, experts, and venues using cita- 
tion analysis. We do not take the textual data into account 
because our aim is finding all conceptually related and high 
quality documents even they use a different terminology. It 
has been shown that text-based similarity is not sufficient for 
this task and that most of the relevant informations are con- 
tained within the citation graph [24]. Besides, it is plausible 
that there is already a correlation between citation similar- 
ities and text similarities of the papers [21] . 

Our aim in this work is to evaluate the existing algorithms 
and to explain the new algorithms that power our service. 
We distinguish two types of algorithms in the literature. 
Some algorithms (such as Cocitation [23], Cocoupling [9] 
and CCIDF uWj only use direct citations and references of 
the seed papers. Other methods (such as PaperRank [5] 
and Katz [24]) perform a deep search of the citation graph 
by traversing all its edges; they are often said to be eigenvec- 
tor based. However, none of these methods allow explicitly 
to search the paper space looking for either old or recent 
papers. 



In this work, we present the class of direction aware algo- 
rithms. They feature a parameter which allows to give more 
importance to either the citation of papers or their refer- 
ences. This parameter makes the citation suggestion process 
easily tunable for finding either recent or traditional relevant 
papers. In particular we extend two eigenvector based meth- 
ods into direction aware algorithms, namely DaRWR and 
DaKatz. 

This paper presents an evaluation of the existing and pro- 
posed algorithms for citation recommendation under the 
light of link prediction and citation patterns. We also in- 
vestigate the potential of the positive and negative feedback 
mechanism our service exposes. Finally we show that ci- 
tation recommendation can be used to recommend venues 
and reviewers better than methods commonly used by re- 
searchers. 

The paper is organized as follows: In Section [2] we briefly 
present a survey for related work. The problems and the 
methods are formally presented in Section [3] The accuracy 
of the methods is experimentally analyzed in Section[4] Sec- 
tion [5] discusses about future work and concludes the paper. 

2. RELATED WORK 

Citation analysis has been successfully used for vari- 
ous tasks including expert finding [I], academic evalua- 
tion of researchers, conferences, journals and papers [3] [7], 
context-aware citation recommendation [61, and impact pre- 
diction [22] . 

There are various citation analysis-based paper recom- 
mendation methods depending on a pairwise similarity mea- 
sure between two papers. Bibliographic coupling, which is 
one of the earliest works, considers papers having similar 
citations as related [9]. Another early work, the Cocita- 
tion method, considers papers which are cited by the same 
papers as related [23]. A similar cites/cited approach by us- 
ing collaboration filtering is proposed by McNee et al. [18] . 
Another method, common citation x inverse document fre- 
quency (CCIDF) also considers only common citations, but 
by weighting them with respect to their inverse frequen- 
cies [TT] . 

More recent works define different measures such as Katz 
which is proposed by Liben-Nowell and Kleinberg for a study 
on the link prediction problem on social networks [l5] and 
used later for information retrieval purposes including ci- 
tation recommendation by Strohman et al. 24 . For two 
papers in the citation network, the Katz measure counts the 
number of paths by favoring the shorter ones. Lu et al. 
stated that both bibliographic coupling and Cocitation 
methods are only suitable for special cases due to their very 
local nature 
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They proposed a method which computes 
the similarity of two papers by using a vector based repre- 
sentation of their neighborhoods in the citation network and 
compared the method with CCIDF. Liang et al. argued that 
most of the methods stated above considers only direct ref- 
erences and citations alone 
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Even Katz and the vector 
based method of [16] consider the links in the citation net- 
work as simple links. Instead, Liang et al. added a weight 
attribute to each link and proposed the method Global Re- 
lation Strength which computes the similarity of two papers 
by using a Katz-like approach. 

Many works use random walk with restarts (RWR) for 
10]. RWR is a well known and 



citation analysis [5] |17| |13| 
efficient technique used for different tasks including comput- 



ing the relevance of two vertices in a graph 19 . It is very 
similar to the well known PageRank algorithm which is used 
by Both Li and Willett [13] (ArticleRank) and Ma et al. [TT] 
to evaluate the importance of the academic papers. Gori and 
Pucci [5] proposed an algorithm PaperRank for RWR-based 
paper recommendation which can also be seen as a Per- 
sonalized PageRank computation [8] on the citation graph. 
Lao and Cohen [To] also used RWR for paper recommenda- 
tion in citation networks and proposed a learnable proximity 
measure for weighting the edges by using machine learning 
techniques. 

As far as we know, none of these works study the re- 
cent/traditional paper recommendation problem. The clos- 
est work is Claper 25 which is an automatic system that 



measure how much a paper is classical, allowing to rank a 
list of paper to highlight the most classical ones. 

3. PROBLEMS AND METHODS 

Let G — (V, E) be the citation graph, with n papers V = 
{vi, . . . ,v n }. In G, each directed edge e = (vi,Vj) G E 
represents a citation from u, to Vj . For the rest of the paper, 
we use the phrases "references of v " and "citations to v " as 
to describe the graph around vertex v (see Figure [SJ. We 
use deg~ (v) and deg + (v) to denote the number of references 
of and citations to v, respectively. 



references 



citations 




2001 i 2002 1 2003 ' 2004 i 2005 ' 2006 



Figure 2: Citation graph around a paper w, with 
references and citing papers. 



In this work, we consider three query types: 

• Paper recommendation (PR): Given a set of m 
seed papers M = {pi, . . . ,p m } and a parameter k s.t. 
A4 C V, return top-fc papers which are relevant to the 
ones in AL 

• Venue recommendation (VR): Given a set of m 
seed papers Ml = {pi,...,p m } and a parameter k, 
return top-fc venues related to the papers in AL 

• Expert recommendation (ER): Given a set of m 
seed papers M = {pi, . . . ,p m } and parameter k, re- 
turn top-fc experts studying on topics related to the 
papers in AL 

These query definitions are generic. They can be used for 
various academic tasks by the researchers. In this paper, we 
target the manuscript preparation and submission process 
since all of queries above are useful in this process: execut- 
ing a PR query is a very efficient way of finding overlooked 
citations in a manuscript with the cited papers as the in- 
put Ml. VR queries are useful while deciding the conference 
or journal for submission. And ER queries are useful while 
submitting a manuscript to some journals which require a 
set of names of potential reviewers. 



3.1 Citation recommendation 

3.1.1 Random walk with restart 

PaperRank is based on random walks in the citation 
graph G. The current structure of G is not suitable for find- 
ing recent and relevant papers since such papers have only a 
few incoming edges. Moreover, since the graph is acyclic, all 
random walks will end up on old papers. To alleviate this, 
given a PR query with inputs M and k, PaperRank con- 
structs a directed graph G' = (V , E') by slightly modifying 
the citation graph G as follows: 

• A source node s is added to the vertex set: 

V = V U {s} 

• Back-reference edges (Et), the edges from s to seed 
papers (Ef), and restart edges from V to s (E r ) are 
added to the graph: 

E b = {{y,x) : (x,y) G E} 
Ef = {(s,v) : v G Al} 
E r = {{v,s):veV} 
E' = E U E b U E f U E r 




Figure 3: Citation graph with source node s and 
seed set M = {pi, . . . ,p m }. The papers a and b are 
cited by p±, where c and d cites p\. Note that there 
is a corresponding back-reference edge for every ref- 



The new directed graph G' has reference (red), back- 
reference (dashed), and restart (gray) edges (see Figure [3|. 
In this model, the random walks are directed towards both 
references and citations of the papers. In addition, the 
restarts from the source vertex s will be distributed to only 
the seed papers in AL Hence, random jumps to any paper 
in the literature are prevented. We assume that a random 
walk ends in v continues with a neighbor with a damping 
factor d G (0, 1]. And with probability (1 — d), it restarts 
and goes to the source s. Let R t -i(v) be the probability 
of a random walk ends at vertex v 7^ s at iteration t — 1. 
Let Ct{v) be the contribution of v to one of its neighbors 
at iteration t. In each iteration, d of R t ~i(v) is distributed 
among its references and citations equally. Hence, 



Ct(v) 



d 



Rt-i(v) 



deg + (v) + deg~ (v) 



(1) 



Initially, a probability score of 1 is given to the source 
node, meaning that a researcher expands the bibliography 



starting with the paper itself: 

1, if x — s 
0, otherwise 



Ro(x) 



(2) 



where Ro is the probability at t = 0. The PaperRank 
algorithm computes the probability of a vertex u at iteration 
t as 



'(1-0 E.ev^t-xM. 



«t-l(s) 



^tM = { E(„,„) 6 B C '*( ?; )+ |M| 



if it = s 
if ti G M 
otherwise. 



(3) 



The PaperRank algorithm converges when the probabil- 
ity of the papers are stable, i.e., when the process is in a 
steady state. Let 

At = (R t (ui) - R t -i(ui),...,Rt(u n )-R t -i(un)) 

be the difference vector. We say that the process is in the 
steady state when the L2 norm of || At || is smaller than given 
value e. That is, 



A*ll= . ^(RtM-Rt-iiv)) 2 <e. 



For a given set of initial papers M, and parameters d and 
e, suppose the algorithm converges. 

Definition 1. The relevance score of a paper u with re- 
spect to the seed papers is equal to the steady state probability 
R(u). 

We choose the top-fc non-seed papers with the highest rele- 
vance scores as the initial recommended paper set TZ pap cr- 

Theorem 1. The PaperRank algorithm converges to a 
steady state in a finite number iterations. Furthermore, 
there is only one steady state distribution and hence, the 
relevance scores are unique. 

PROOF. Consider the subgraph H = (Vh,Eh) C G' 
induced by the source s and all vertices reachable from 
the source. That is, Vh = {w 6 V : Rt(u) > 0} and 
E H = {V H x Vh) n E' . For each u G V H \ {s} there is 
a directed edge (u, s) and a directed path s — > u. Hence, 
each vertex pair in Vh is connected to each other and H is 
strongly connected. Thus, the transition matrix of the cor- 
responding Markov chain is irreducible. Hence, the steady 
state exists and is unique. □ 

3.1.2 Direction aware random walk with restart 

A random walk with restart is a good way to find relevance 
scores of the papers. However, the PaperRank algorithm 
treats the citations and references in the same way. This may 
not lead the researcher to recent and relevant papers if she is 
more interested with those. Old and well cited papers have 
an advantage with respect to the relevance scores since they 
usually have more edges in G' . Hence G' tends to have more 
and shorter paths from the seed papers to old papers. We 
define a direction awareness parameter A £ [0, 1] to obtain 
more recent results in the top-fc documents. We then define 
two types of contributions of each paper v to a neighbor 
paper in iteration t: 



C?{v) = dX 



cr(v) = d(i-\) 



deg+(v)' 

Rt-iM 



deg- 0) 



(4) 
(5) 



where C t ~ (v) is the contribution of v to a paper in its refer- 
ence list and C t + (v) is the contribution of v to a paper which 
cites v. Hence, for a non-seed, non-source paper u, 



Rt(u) 



J2 c+(v)+ J2 



(6) 



For a seed node it, the Rt(u) is computed similarly except 
that each seed node has an additional — r^r^ m the equa- 
tion. Rt(s) is computed in the same way as {3). With this 
modification, the parameter A can be used to give more im- 
portance either to traditional papers with A G [0, 0.5] or 
recent papers with A G [0.5, 1]. We call this algorithm direc- 
tion aware random walk with restart (DaRWR). 

Note that DaRWR |6| has the probability leak problem 
when a paper has no references or citations. If this is the 
case some part of its score will be lost at each iteration. For 
such papers, we distribute the whole score from the previous 
iteration towards only its references or citations. 

3.1.3 Katz and direction awareness 

The direction awareness can be also adapted to other sim- 
ilarity measures such as the graph-based Katz distance mea- 
sure [15] which was used before for the citation recommenda- 
tion purposes [24]. With Katz measure, the similarity score 
between two papers u, v G V is computed as 

L 

Katz(u,v) — y f3 l \paths l UtV \, 
j=i 

where j3 G [0, 1] is the decay parameter, L is an integer 
parameter, and \paths l U:V \ is the number of paths with 
length i between u and v in the graph with paper and 
back-reference edges G" = (V,EuEi,). Notice that the 
path does not need to be elementary, i.e., the path uvuv 
is a valid path of length 3. Therefore the Katz measure 
might not converge for all values of /3 when L = 00. /3 
needs to be chosen smaller than the larger eigenvalue of 
the adjacency matrix of G" . And in practice L is set to a 
fixed value (in our experiment L = 10). In our context with 
multiple seed papers, the relevance of a paper v is set to 

We extend the Katz distance by using direction aware- 
ness to weight the contributions to references and citations 
differently with the A parameter as in DaRWR: 

L 

DaKatz(u,v) — \\f3 1 \Rpaths l u <v \ + (1 — A)/3 I |Cpat/is^ t ,|j 

i=l 

where \Rpaths\ l v \ (respectively, \Cpaths z u v \) is the number 
of paths in which the last edge in the path is a reference 
edge of E (respectively, a citation edge of Eb). 

3.2 Venue and Reviewer recommendation 

Given a VR query with inputs M and k, we execute 
the paper recommendation process and obtain the relevance 
scores of all papers in the database. The relevance score of 
each venue v is computed as the sum of relevance scores of 
all papers published in that venue, i.e., 



R{v) = 



E 



R{u) 



u is published in v 
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Figure 4: Average shortest distance of top-10 rec- 
ommendations by DaRWR from seed papers based 
on the parameters d and A. 



Figure 5: Average publication years of top-10 rec- 
ommendations by DaRWR based on the parameters 
d and A. 



We then choose the top-fe venues with the highest relevance 
scores as the suggestion set R, e „„ e . 

Similarly, given an ER query with inputs M and k, we 
execute the paper recommendation process and obtain the 
relevance scores of all papers in the database. The relevance 
score of each expert a is computed as the sum of relevance 
scores of all papers written by a, i.e., 

R(a) = 

u is written by a 

We then choose the top-fc researchers with the highest rele- 
vance scores as the suggestion set lZ e x P ert- 

4. EXPERIMENTS 

We carefully evaluate the accuracy of the proposed di- 
rection aware algorithms by comparing them with existing 
baselines and algorithms. Here, we give the details and re- 
sults of these experiments. 

4.1 Dataset collection 

The retrieval of bibliographic information and citation 
graph generation is a difficult task since academic papers 
are generally copyrighted and they are accessible through 
publishers' digital libraries. The usage of such data is usu- 
ally not explicitly granted, therefore, we limited our study 
to data with license compatible with data mining. 

We retrieved informations about 1.75M (as of Dec 2011) 
computer science articles from DBLP 12 . This data is 
well-formatted, author names are disambiguated; however, 
it does not contain any reference information. On the other 
hand, CiteSeer contains reference information but most of 
its data are automatically generated [I] and are often erro- 
neous. We mapped each document in CiteSeer to at most 
one document in DBLP by using the title information (using 
an inverted index on title words and Levenshtein distance) 
and by their years. When two documents in CiteSeer map 
to the same document in DBLP, their citation information 
are merged. From the 1,748,199 documents references in 
DBLP, only 295,317 are properly associated with a refer- 
ence in CiteSeer written by 1,028,288 authors. The graph 
has 1,601,067 citation edges. Notice that a mapping be- 
tween CiteSeer data and DBLP data has been computed 
before using canopy clustering with three times higher cov- 
erage [20] . Although we could not match a that much of the 
data, we believe the data are enough to derive meaningful 
conclusions. 



4.2 Citation recommendation experiments 

4.2.1 Parameter tests 

Before performing a comparison of the different methods 
presented in the paper, we study the impact of the damp- 
ing factor d and the direction awareness parameter A on the 
recommendations given by the DaRWR algorithm. In par- 
ticular, we want to verify that changing these parameters 
allows the user to obtain suggestions that are farther away 
from the seed papers M and to obtain suggestions that are 
either recent or more traditional. To verify these effects, a 
source paper published between 2005 and 2010 is randomly 
selected and the paper's references are used as the seed pa- 
pers. We use the top-10 results as the set of suggestions. 
The test is repeated 500 times. 

Figure[4]shows the impacts of parameters d and A as a heat 
map on the average shortest distance in the citation graph 
between the recommended papers lZ pap er and the seed pa- 
pers M. When d increases, the probability that the random 
research jumps back to the source node s is reduced. There- 
fore, the distant vertices are visited with more probability 
between two successive restarts, resulting in papers away 
from M being more likely to be in lZ pap er- Figure [4] shows 
that A makes little difference in the average distance to the 
seed papers. However, setting a higher value of d should 
allow to find relevant papers whose relation to the seeds are 
not obvious. 

Figure [5] shows the impacts of parameter d and A on the 
average year of the recommended papers in lZ pap er as a heat 
map. Increasing the damping factor leads to earlier papers 
since they tend to accumulate more citations. But for a 
given A, varying the damping factor do not allow to reach 
a large diversity of time frames. The direction awareness 
parameter A can be adjusted to reach papers from different 
years with a range from late 1980's to 2010 for almost all 
values of d. In our online service, the parameter A can be set 
to a value of user's preference. It allows the user to obtain 
recent papers by setting A close to 1 or finding older papers 
by setting A close to 0. 

Overall, first-level papers are often returned for d < 0.8; 
yet many papers at distance 2 and more appear. Also, it 
is possible to choose between traditional papers (by setting 
A < 0.4) or recent papers (by setting A > 0.8) thanks to the 
direction awareness parameter. 

4.2.2 Experimental settings 




0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 

I I I X 

(a) hide random (b) hide recent (c) hide earlier (d) future prediction 

Figure 7: Accuracy of DaRWR method with different A and d parameters on different experiments. 



Table 6: Parameters used in the experiments. 



Method 


Random | Recent | Earlier | Future 


Katz/3 


P = 0.0005 


DaKatz 


£ = 0.005 
A = 0.25 


/3=0.0005 
A = 0.75 


/3=0.0005 
A = 


£=0.005 
A = 0.25 


PaperRank 


d = 0.5 


d = 0.9 


d= 0.9 


d = 0.75 


DaRWR 


A = 0.5 
d = 0.75 


A = 0.9 
d = 0.5 


A = 0.1 
d= 0.5 


A = 0.5 
d = 0.75 



We test the quality of the recommended citations by dif- 
ferent methods in four different scenarios. 

Hide random scenario represents the typical use-case 
where a researcher is writing a paper and trying to find 
some more references. To simulate that, a source paper s 
with enough references (deg + (s) > 20) is randomly selected 
from the papers published between 2005 and 2010. Then 
we remove s and all the papers published after s from the 
graph (i.e., G s = (V S ,E S ) where V s C V \ {s} and \fv G 
V B , year [v] < year[s\), simulating the time when s was being 
written. Out of deg + (s), 10% of the references are randomly 
put in the hidden set H, and the rest is used as the seed 
papers (i.e., M — {v ^ H : (s,v) £ E}). We compute 
the citation recommendations on A4 and report the average 
accuracy of finding hidden papers within the top deg + {s) 
recommendations for 500 independent queries. 

Hide recent scenario represents another typical use-case 
where the author might be well aware of the literature of her 
field but might have missed some recent developments. It 
differs from hide random while hiding the references. Here, 
the references that are put in H are not chosen randomly. 
They are the most recent references. Again, the average 
accuracy of finding hidden papers within the top deg + {s) 
recommendations is reported for each source s. 

In the hide earlier scenario, the author is interested in 
finding some key papers related to the field. This scenario is 
exactly the opposite of hide recent, i.e., the hidden papers 
are the oldest publications. The average accuracy of find- 
ing those hidden traditional papers within the top deg + {s) 
recommendations is reported for each source s. 

Future prediction scenario investigates the accuracy of 
a recommendation system while providing a link between 
two papers which are not known to be related yet. It veri- 
fies if the algorithm can predict which paper will be cited by 
a given paper. For this test, the source paper s is selected 



similarly. However, the graph selected for the recommenda- 
tion include paper s but exclude all subsequent papers (i.e., 
G s = (V S ,E S ) with v € V s year[v] < year[s\). And 

all the references of the s are used as the seeds to obtain 
a top-10 recommendations. The accuracy of the algorithm 
is estimated by counting how many of the documents that 
appear in the top-10 is later co-cited with the source paper. 

The methods we proposed are compared on the three sce- 
narios against widely-used citation based approaches: bibli- 
ographic coupling [9], Cocitation |23| 
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Paper- 



CCIDF 

Rank [5] and the original Katz distance [15]. The algorithms 
and the parameters that lead to the best accuracy in differ- 
ent experiments are summarized in Table [6] 

4.2.3 Results 

Figure [7] presents the accuracy obtained by the DaRWR 
for different combinations of the parameters d and A on the 
four scenarios. The results show that extreme values of the 
parameter are typically not the one that obtain the high- 
est accuracy. On the hide random experiment, DaRWR 
performs best with d = 0.75 and A = 0.5. A similar com- 
bination set (d = 0.75, A = 0.9) obtains a high accuracy on 
the hide recent experiment. However it is best processed 
with parameters d = 0.5 and A = 0.9. As expected, the hide 
earlier experiment is best solved using a low value of the di- 
rection awareness parameter (d = 0.5, A = 0.1). The future 
prediction experiment is best solved by the d = 0.75, A = 0.5 
parameter set. Still using d = 0.5 leads to solutions of rea- 
sonable accuracy. It is interesting to notice that the hide 
random and future prediction experiments show similar pat- 
tern while the hide recent and hide earlier experiments show 
opposite patterns. This experiment tells us that it is enough 
to set A as tunable for the service since tuning d has little 
impact once it is set to a reasonable value. Most likely, 
setting d as tunable will add only more complexity and no 
significant improvement in the accuracy. 

Figure [8] presents a comparison of all the methods on the 
same scenarios. Many algorithms are represented as hor- 
izontal lines since they are not direction aware. The first 
remark is that Cocoupling and CCIDF perform poorly on 
all four scenarios. Cocitation performs the worse in the hide 
recent scenario and performs reasonably good but not the 
best in the other three scenarios. These methods which only 
consider counting and weighting of distance 2 edges at most 
from the seeds are out-performed by the eigenvector based 
methods which take whole graph into account. 
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Figure 8: Accuracy of the algorithms on (top left) hide random, (top right) hide recent, (bottom left) hide 
earlier, and (bottom right) future prediction experiments based on A and other parameters. Note that the 
accuracy of Katz is equal to DaKatz at A = 0.5. 



Notice that PaperRank performs well overall but for dif- 
ferent values of the damping parameter d. The performance 
of DaKatz is significantly varying with the parameter set 
but it is important to notice that the variations with the di- 
rection awareness parameter are similar to the one observed 
on DaRWR. The results of Katz are not explicitly pre- 
sented but can be read on DaKatz when A = 0.5. Notice 
that DaKatz is always a better method that Katz. Paper- 
Rank achives the best results when the query is generic (on 
the hide random and future prediction scenarios); however 
direction aware methods lead to higher accuracy when the 
query is specific. 

The previous experiments show that the method we pro- 
posed return results of higher accuracy. However, these re- 
sults do not allow us to understand whether the methods 
return similar results or different results. Table [9] presents 
the intersection matrix of the different methods on four sce- 
narios. Each method's parameters are set to optimize the 
accuracy. The diagonal of the matrix shows the actual ac- 
curacy of the methods. Other values show the percentage of 
the intersection of two corresponding methods. For instance, 
one can read that on the hide random scenario, PaperRank 
has an accuracy of 51.30% while CCIDF has an accuracy of 
20.12%. The intersection between the results of CCIDF and 
PaperRank has an accuracy of 17.23% indicating that most 
of the relevant results returned by CCIDF were also results 
by PaperRank in that scenario. In the hide recent and 
hide random scenarios, the proposed method clearly dom- 



inate the solution space. The other methods do not add 
many new relevant suggestions. 

The case of the future prediction scenario is different. 
The intersection between the different methods often high- 
light that a significant portion of the returned suggestion 
differ between the algorithms. For instance, the intersec- 
tion between DaRWR and Cocoupling scores an accuracy 
of 5.68% which is 5 times smaller than the accuracy of Co- 
coupling (25.22%) and 7.5 times smaller than the accurary 
of DaRWR (39.08%). 

4.2.4 Citation patterns 

For a better understanding of the difference between the 
accuracy obtain by different methods, we did a study on 
the properties of the suggestions returned by the methods 
and compare them to the properties of the actual references 
within the papers. We argue that highly relevant suggested 
papers should have similar patterns to the actual references. 

One feature to measure the citation patterns is the clus- 
tering coefficient [26] . The clustering coefficient C v of paper 
v is computed as: 

r = \{(i,j)eE\i,j eN v u{v}}\ 
\N V \ x (|JV tt |+l) 

where N v is the set of neighbor papers of v which either cite 
v or are cited by v. Intuitively, the clustering coefficient in- 
dicates how close of being a clique a vertex and its neighbors 
are. 



Table 9: Intersection matrix of the results for (i) hide random, (ii) hide recent, (iii) hide earlier, and (iv) 
future prediction experiments. 
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The other metric we consider is the PageRank 2 of a 
vertex which can be calculated by putting all vertices in M 
during the PaperRank algorithm. 

Figure [10] presents the cumulative density function of the 
clustering coefficient and of the PageRank of the documents 
suggested by each algorithm and of the hidden papers in 
the three hidden scenarios. The first observation is that 
on all charts the Cocitation algorithm is an outlier. Also, 
CCIDF and Cocoupling are almost indistinguishable on all 
charts. Interestingly, the clustering coefficient of the hidden 
papers in the hide earlier scenario are lower than in the hide 
random scenario and the clustering coefficient of the hidden 
paper in the hidden recent scenario are the highest. The 
trend is reverse with PageRank. Older papers have more 
time to become famous so their PageRank is higher. And 
since they have more citations, it is less likely that their 
neighbors are close to form a clique. This highlights that 
papers published in different years have different profiles, 
bolstering our claim that one should not use the very same 
algorithm and parameters to look for them. 

For the hide random scenario, PaperRank suggests pa- 
pers of clustering coefficient very similar to the hidden pa- 
per, while DaRWR, Katz, and DaKatz show a different 
but parallel trace. The PageRank distribution of the al- 
gorithm shows a similar picture, except Katz is close the 
hidden paper and PaperRank, DaRWR, and DaKatz are 
farther away. 

In the hide recent scenario, most algorithms have a sim- 
ilar trace for both the clustering coefficient and PageRank. 
PaperRank and Katz are significantly different than their 
direction aware variants and the trace of the hidden paper. 
Recall that PaperRank and Katz are also less accurate 
than their direction aware variants on the hide recent sce- 
nario. Having a similar trace is an important property but 
it is not enough to reach a high accuracy. Indeed, Cocou- 
pling and CCIDF show a trace similar to the that of hidden 
papers in that scenario but with less accuracy. 

In the hide earlier scenario, the direction aware algorithm 
have patterns similar to the hidden paper for both metric 
explaining the high accuracy they reach. PaperRank has 
a PageRank pattern similar to the hidden paper but a dif- 
ferent clustering coefficient pattern and it does not reach 
the high accuracy level the direction aware algorithms ob- 



tain. Katz's pattern is similar to that of the hidden paper 
neither in clustering coefficient nor on PageRank and it is 
the one with the lowest accuracy among all the eigenvector 
based methods. 

This analysis shows that direction aware algorithms have 
overall similar citation patterns. CCIDF and cocoupling 
have typically similar citation patterns. The difference in 
accuracy of the eigenvector based methods can be explained 
by the similarity in citation patterns between the papers one 
is looking for and what is generated by the method. The di- 
rection aware methods are more flexible and can be tuned to 
match the property of the query leading to higher accuracy. 
The reasons of success or failure of the non-eigenvector based 
methods (Cocitation, Cocoupling, and CCIDF) seem to be 
unrelated to the citation pattern metrics we considered. 

4.3 Relevance feedback experiments 

Relevance feedback is an important part of the recommen- 
dation system since users may give positive and negative 
feedbacks on the results in order to reach to desired papers 
or topics. In this test, 500 source papers are randomly se- 
lected, and for each source paper s the graph is pruned by 
removing the papers published after a. Then, a target pa- 
per u is selected from the pruned graph, such that it is the 
most relevant paper at distance 5 from u. Assuming that a 
user can only display 10 results at a time, we measure the 
number of pages that the user has to go through until she 
reaches t. We compare the feedback mechanism with the 
following idealized user behavior: 

No feedback: There is no feedback mechanism; therefore, 
user should keep looking the next page until she finds the 
target paper. 

Only positive feedback: Results are labeled as relevant 
and added to M in the next step or should not be dis- 
played again. 

Only negative feedback: Results are labeled as irrele- 
vant to be removed from the graph or should not be dis- 
played again. 

Both positive and negative: Results are labeled as ei- 
ther relevant to be added to M or irrelevant to be re- 
moved from the graph. 
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Figure 10: Clustering coefficient (top) and Pagerank (bottom) of the suggested citations for the hide earlier 
(left), hide random (center), and hide recent (right) experiments. 



Detailed results for that experiment are omitted. Using 
negative feedback only reduces the number of pages one has 
to go through by 82.29% in average and using positive feed- 
back allows to reduce the number of pages by 97.15% in 
average. Using both negative and positive feedback reduces 
the number of pages by 97.20% in average. This result shows 
that using the feedback mechanism allows to significantly 
speedup the process of searching for specific references. 

4.4 Venue and reviewer recommendation ex- 
periments 

The venue recommendation methods is tested on the as- 
sumption that a paper is published in a venue where it is 
relevant. The following protocol relies on this assumption. 
A source paper is randomly selected and is removed from 
the graph as long as all subsequent papers. The objective is 
to find the venue of the source paper in TZ venue containing 
k = 10 venues. We compare the performance of our methods 
against a method commonly employed by researcher, which 
consist in considering the top-10 most occurring venues of 
the paper of interest; e.g., the M set. We call this algo- 
rithm Baseline 1. Another algorithm, Baseline 2, consid- 
ers the venues of the paper at distance 2 of the source paper: 
it returns the top-10 most occurring venues in M. and the 
references and citation of these documents. 

The reviewer recommendation experiment is based on the 
assumption that "the authors are the best reviewers for the 



Table 11: Average accuracy of venue recommenda- 
tion (VR) and reviewer recommendation (RR) ex- 
periments. 
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PaperRank 
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56.0 
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48.38 
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60.0 
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44.04 



paper" (ignoring the obvious conflict-of-interest, and by best 
reviewers referring to people that have the enough knowl- 
edge on this candidate paper). The experiment is con- 
ducted similarly to the venue recommendation experiment. 
A source paper is selected and is removed from the graph 
as long as all subsequent papers. For a list lZ exp ert which 
contains k = 25 experts, we distinguish whether none of the 
authors of the source paper is found, if any author is found 
or if all the authors are found. Both baselines are defined in 
the same way as in the venue recommendation experiment. 

Table [Tl] presents the average accuracy of these meth- 
ods when run on 500 random (uniform) source papers. For 



venue recommendation, the three proposed methods per- 
form better than Baseline 1 and DaRWR perform better 
than Baseline 2. The differences are marginal (less than 
10%) but statistically significant. For reviewer recommen- 
dation, DaRWR performs the best. Interestingly Baseline 2 
performs worse than Baseline 1 in both experiments. 

5. CONCLUSION AND FUTURE WORK 

In this paper, we present direction aware algorithms for 
citation recommendation. A direction aware model allows 
to tune the search for finding more recent or more tradi- 
tional documents. We developed two algorithms based on 
the direction aware model, namely DaKatz and DaRWR. 
We also suggest to use the classical random walk with restart 
(PaperRank) for academic recommendation. Experimen- 
tally, we confirmed that the parameters can be easily set to 
browse the academic web of knowledge. In our experiments, 
the direction aware algorithm we propose outperforms the 
existing algorithms for citation recommendation which are 
based only on the citation graph in experiments that focus 
on finding either traditional or recent papers. We imple- 
mented the algorithms in our webservice which allows any 
researcher to upload a bibliography file and obtain sugges- 
tions. This service is freely available and easy to use. Cou- 
pled with our efficient algorithms, we believe that our service 
will become a tool of major interest for researchers. 

As future work, we want to improve our service both in 
theory and practice. We are planning to test weighting 
schemes on edges to have a better distribution of probability 
to papers with high quality. In practice, we will improve the 
amount and the quality of the bibliographic data by using 
existing techniques such as canopy clustering and by ob- 
taining data from more public academic databases. We are 
also planning to conduct an intensive user study to obtain 
a real-world evaluation of the system. 
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